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. ' This paper handles a kind of strategic game called potential games and develops a novel learning 

O . algorithm Payoff-based Inhomogeneous Partially Irrational Play (PIPIP). The present algorithm is based 

on Distributed Inhomogeneous Synchronous Learning (DISL) presented in an existing work but, unlike 

^ ' DISL, PIPIP allows agents to make irrational decisions with a specified probability, i.e. agents can choose 

00 ; 

(T^ , an action with a low utility from the past actions stored in the memory. Due to the irrational decisions, 

00 ' 



we can prove convergence in probability of collective actions to potential function maximizers. Finally, 



t^^ ■ we demonstrate the effectiveness of the present algorithm through experiments on a sensor coverage 

o 

problem. It is revealed through the demonstration that the present learning algorithm successfully leads 

agents to around potential function maximizers even in the presence of undesirable Nash equilibria. 

We also see through the experiment with a moving density function that PIPIP has adaptability to 

5_^ ■ environmental changes 
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I. Introduction 

Cooperative control of multi-agent systems basically aims at designing local interactions of 
agents in order to meet some global objective of the group [HI, [|2l. It is also required depending 
on scenarios that agents achieve the global objective under imperfect prior knowledge on envi- 
ronments while adapting to the network and environmental changes. Nevertheless, conventional 
cooperative control schemes do not always embody such functions. For example, in sensor 
deployment or coverage, most of the control schemes as in |l3l, flU, [|51 assume prior knowledge 
on a density function defined over a mission space and hence are hardly applicable to the mission 
over unknown surroundings. A game theoretic framework as in [6] holds tremendous potential 
for overcoming the drawback of the conventional schemes. 

A game theoretic approach to cooperative control formulates the problems as non-cooperative 
games and identifies the objective in cooperative control with arrival at some specific Nash 
equilibria O, 0, (HI. In particular, it is shown by J. Marden et al. flU that a variety of cooperative 
control problems are related to so-called potential games [9J. Unlike the other game theory, 
potential games give a design perspective, which consists of two kinds of design problem: utility 
design and learning algorithm design ifTOll . The objective of utility design is to align local utility 
functions to be maximized by each agent so that the resulting game constitutes a potential game, 
where the literature [[TT|. [[T2l| provides general design methodologies. The learning algorithm 
design determines action selection rules of agents so that the actions converge to Nash equilibria. 

In this paper, we focus on the learning algorithm design for cooperative control of multi-agent 
systems. A lot of learning algorithms have been established in game theory literature and recently 
some algorithms are also developed mainly by J. Marden and his collaborators. The algorithms 
therein are classified into several categories depending on their features. 

The first issue is whether an algorithm presumes finite or infinite memories. For example. 
Fictitious Play (FP) [[HI, Regret Matching (RM) [Qll, Joint Strategy Fictitious Play (JSFP) 
with Inertia [fTSl and Regret-Based Dynamics lfT6l require infinite number of memories for 
executing the algorithms. Meanwhile, Adaptive Play (AP) ifTTl . Better Reply Process with Finite 
Memory and Inertia iHHl, (Restrictive) Spatial Adaptive Play ((R)SAP) [[191, [0 and Payoff-based 
Dynamics (PD) [|20| , Payoff-based version of Log-Linear Learning (PLLL) [|2T| and Distributed 
Inhomogeneous Synchronous Learning (DISL) [[3 require only a finite number of memories. Of 
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course, the finite memory algorithms are more preferable for practical applications. 

The second issue is what information is necessary for executing learning algorithms. For 
example, FP presumes that all the information of the other agents' actions are available, which 
strongly restricts its applications. On the other hand, RM, JSFP with Inertia and (R)SAP assume 
availability of a so-called virtual payoff, i.e. the utility which would be obtained if an agent 
chose an action. Moreover, PD, PLLL and DISL utilize only the actual payoffs obtained after 
taking actions, which has a potential to overcome the aforementioned drawback of the sensor 
coverage schemes |I3. 

The main objective of standard game theory is to compute Nash equilibria and hence most 
of the above algorithms except for |l6l, [[2T]| assure only convergence to pure Nash equilibria. 
However, in most of cooperative control problems, it is insufficient for achieving the global 
objective and selection of the most efficient equilibria is required f2\\. In this paper, we thus 
deal with convergence of the actions to the Nash equilibria maximizing the potential function 
which are called optimal Nash equilibria in this paper, since the potential function is usually 
designed in many cooperative control problems so that its maximizers coincide with the action 
profiles achieving the global objectives. 

The primary contribution of this paper is to develop a novel learning algorithm called Payoff- 
based Inhomogeneous Partially Irrational Play (PIPIP). The learning algorithm is based on 
DISL presented in [|71 and inherits its several desirable features: (i) The algorithm requires finite 
and a little memory, (ii) The algorithm is payoff-based, (iii) The algorithm allows agents to 
choose actions in a synchronous fashion at each iteration, (iv) The action selection procedure in 
PIPIP consists of simple rules, (v) The algorithm is capable of dealing with constraints on action 
selection. The main difference of PIPIP from DISL is to allow agents to make irrational decisions 
with a certain probability, which renders agents opportunities to escape from undesirable Nash 
equilibria. Thanks to the irrational decisions, PIPIP assures that the actions of the group converge 
in probability to optimal Nash equilibria, though only convergence to a pure Nash equilibrium 
is proved in [|3. Meanwhile, some learning algorithms as in [[6l, [llTI dealing with convergence 
to the optimal Nash equilibria have been presented and we also mention the advantages of 
PIPIP over these learning algorithms in the following. RSAP [[6l guarantees convergence of the 
distribution of actions to a stationary distribution such that the probability staying the optimal 
Nash equilibria is arbitrarily specified by a design parameter. However, RSAP is not synchronous 
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and virtual payoff-based and hence its applications are restricted. PLLL 11211 also allows irrational 
and exploration decisions similarly to PIPIP and the resulting conclusion is almost compatible 
with this paper. However, in ETI . how to handle the action constraints is not explicitly shown 
and convergence in probability to the optimal Nash equilibria is not proved in a strict sense. 

The secondary contribution of this paper is to demonstrate the effectiveness of the present 
learning algorithm through experiments on a sensor coverage problem, where the learning 
algorithm is applied to a robotic system compensated by local controllers and logics. Such 
investigations have not been sufficiently addressed in the existing works. Here, we mainly 
check the performance of the learning algorithm in finite time and adaptability to environmental 
changes. In order to deal with the former issue, we prepare obstacles in the mission space to 
generate apparent undesirable Nash equilibria. Then, we compare the performance of PIPIP 
with DISL. The results therein will support our claim that what this paper provides is not a 
minor extension of |I3 and contains a significant contribution from a practical point of view. 
We next demonstrate the adaptability by employing a moving density function defined over the 
mission space. Though adaptation to time-varying density is in principle expected for payoff- 
based algorithms, its demonstration has not been addressed in previous works. We see from 
the results that desirable group behaviors, i.e. tracking to the moving high density region are 
achieved by PIPIP even in the absence of any knowledge on the density. 

This paper is organized as follows: In Section |lll we give some terminologies and basis 
necessary for stating the results of this paper. In Section [nil we present the learning algorithm 
PIPIP and state the main result associated with the algorithm, i.e. convergence in probability to 
the optimal Nash equilibria. Then, Section |IV] gives the proof of the main result. In Section |Vl 
we demonstrate the effectiveness of PIPIP through experiments on a sensor coverage problem. 
Finally, Section IVTl draws conclusions. 

II. Preliminary 

A. Constrained Potential Games 

In this paper, we consider a constrained strategic game T = {V,A,{Ui{-)}i(zv,{TZi{-)}ii=v). 
Here, V := {1, ■ ■ ■ , ra} is the set of agents' unique identifiers. The set A is called a collective 
action set and defined as ^ := ^i x ■ ■ ■ x An, where Ai, i G V is the set of actions which agent 
i can take. The function f/j : ^ — )■ M is a so-called utility function of agent i E V and each 
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agent basically behaves so as to maximize the function. The function TZi : Ai -^ 2-^^ provides 
a so-called constrained action set and TZi{ai) is the set of actions which agent i will be able to 
take in case he takes an action Oj. Namely, at each iteration t E Z^ := {0,1,2,- ■ ■}, each agent 
chooses an action aj(t) from the set lZi{ai{t — 1)). 

Throughout this paper, we denote collection of actions other than agent i by 

a_j := (oi, ■ ■ ■ , ai-i, a^+i, ■ ■ ■ , a^). 

Then, the joint action a = (ai, ■ ■ ■ , a„) G ^ is described as a = (oj, a_j). Let us now make the 
following assumptions. 

Assumption 1 The function TZi : Ai ^ 2-^* satisfies the following three conditions. 

• (Reversibility JU) For any i e V and any actions a], af G Ai, the inclusion a^ € Tli{al) is 
equivalent to aj E TZi{af). 

• (Feasibility [J6|) For any i eV and any actions a\,al['' E Ai, there exists a sequence of 
actions a\ ^ af ^ ■ ■ ■ ^ aJ" satisfying a\ E TZi{a[~^) for all / G {1, ■ ■ ■ , m}. 

• For any i eV and any action a^ E Ai, the number of available actions in TZi{ai) is greater 
than or equal to 3. 

Assumption 2 For any {a, a') satisfying a- G Tli{ai) and a_j = a'_j, the inequality t/i(a') — 
Ui{a) < 1 holds true for all i E V. 

Assumption [2] means that when only one agent changes his action, the difference in the utility 
function f/j should be smaller than 1. This assumption is satisfied by just scaling all agents' 
utility functions appropriately. 

Let us now introduce the potential games under consideration in this paper. 

Definition 1 (Constrained Potential Games (S]], IIH) A constrained strategic game F is said 
to be a constrained potential game with potential function : ^ — )• R if for alH G V, every 
Oj G Ai and every a_j G rij^^i-^i' the following equation holds for every a- G TZi{ai). 

Ui{a[, a_i) - Ui{ai, a_i) = (j){a[, a_i) - (f){ai, a_i) (1) 

Throughout this paper, we suppose that a potential function (p is designed so that its maximizers 
coincide with the joint action a achieving a global objective of the group. Under the situation, 
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([U) implies that if an agent changes his action, the change of the local objective function is equal 
to that of the group objective function. 

We next define the Nash equilibria as below. 

Definition 2 (Constrained Nash Equillibria) For a constrained strategic game T, a collection 
of actions a* G ^ is said to be a constrained pure Nash equilibrium if the following equation 
holds for all i eV. 

Ui{a*,a*_-)= max Ui{ai,a*_-) (2) 

It is known [|71, [0 that any constrained potential game has at least one pure Nash equilibrium 
and, in particular, a potential function maximizer is a Nash equilibrium, which is called an optimal 
Nash equilibrium in this paper. However, there may exist undesirable pure Nash equilibria not 
maximizing the potential function. In order to reach the optimal Nash equilibria while avoiding 
undesirable equilibria, we have to design appropriately a learning algorithm which determines 
how to select an action at each iteration. 

B. Resistance Tree 

Let us consider a Markov process {Pt} defined over a finite state space X. A perturbation 
of {P^} is a Markov process whose transition probabilities are slightly perturbed. Specifically, 
a perturbed Markov process {P/}, e E [0, 1] is defined as a process such that the transition of 
{Pt} follows {Pf} with probability 1 — e and does not follow with probability e. Then, we 
introduce a notion of regular perturbation as below. 

Definition 3 (Regular Perturbation lHH) A family of stochastic processes {-P/} is called a 
regular perturbation of {P°} if the following conditions are satisfied: 

(Al) For some e* > 0, the process {P/} is irreducible and aperiodic for all e E (0,5*]. 
(A2) Let us denote by P^y the transition probability from x E X to y E X along with the 

Markov process {-P/}. Then, lim^^o-Pxy = ^xy holds for all x,y E X. 
(A3) If P^y > for some e, then there exists a real number xi^ ~^ v) ^ ^ such that 

pe 

where x{^ ~^ v) is called resistance of transition from x to y. 
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Remark that, from (Al), if {Pf } is a regular perturbation of {-P°}, then {Pf } has the unique 
stationary distribution ii{e) for each e > 0. 

We next introduce the resistance A(r) of a path r from x E X to x' E X along with transitions 

^(0) = 2; _i. 3;{2) _j. . . . _i. 3;{m) — 3;' as tj^e value satisfying 

PHr) 

where P'^{r) denotes the probability of the sequence of transitions. Then, it is easy to confirm 
that A(r) is simply given by 

m—l 

X{r) = J2x{x^'^-^x^'^^^). (5) 

A state a; G A:" is said to communicate with state y E X if both x -^ y and y -^ x hold, 
where the notation x -^ y implies that y is accessible from x i.e. a process starting at state x 
has non-zero probability of transitioning into y at some point. A recurrent communication class 
is a class such that every pair of states in the class communicates with each other and no state 
outside the class is accessible to the class. Now, let Hi,- ■ ■ ,Hj be recurrent communication 
classes of Markov process {-P^P}. Then, within each class, there is a path with zero resistance 
from every state to every other. In case of a perturbed Markov process {P/}, there may exist 
several paths from states in Hi to states in H^ for any two distinct recurrent communication 
classes Hi and H^. The minimal resistance among all such paths is denoted by xik- 

Let us now define a weighted complete directed graph G = ("H, "H x "H, W) over the recurrent 
communication classes 1-L = {Hi, ■ ■ ■ , Hj}, where the weight wik G W of each edge {Hi, Hi) 
is equal to the minimal resistance xik- We next define l-tree which is a spanning tree over G 
with a root node Hi E 1-L. We also denote by Q{1) the set of all /-trees. The resistance of an 
l-tree is the sum of the weights on all the edges of the tree. The stochastic potential of the 
recurrent communication class Hi is the minimal resistance among all /-trees in Q{1). We also 
introduce the notion of stochastically stable state as below. 

Definition 4 (Stochastically Stable State Ill9]| ) A state x E X \% said to be stochastically 
stable, if x satisfies lim^^o+A^xl^^) > 0, where fixi^) is the value of an element of stationary 
distribution ii{e) corresponding to state x. 

Using the above terminologies, we introduce the following well known result which connects 
the stochastically stable states and stochastic potential. 
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Proposition 1 |fT9l Let {P^} be a regular perturbation of {P°}. Then lime^o+/^(^) exists and the 
limiting distribution fi(0) is a stationary distribution of {P°}. Moreover the stochastically stable 
states are contained in the recurrent communication classes with minimum stochastic potential. 

C. Ergodicity 

Discrete-time Markov processes can be divided into two types: time-homogeneous and time- 
inhomogeneous, where a Markov process [Pt] is said to be time-homogeneous if the transition 
matrix denoted by Pt is independent of the time and to be a time-inhomogeneous if it is time 
dependent. We also denote the probability of the state transition from time k^ to time k by 
P{ko.k) = YizlPt. 0<ko<k. 

For a Markov process {Pt}, we introduce the notion of ergodicity. 

Definition 5 (Strong Ergodicity Il23ll ) A Markov process {Pt} is said to be strongly ergodic if 
there exists a stochastic vector /i* such that for any distribution /i on A" and time ko, we have 

\imk^oofJ'P{ko,k) = /i*. 

Definition 6 (Weak Ergodicity Il23ll ) A Markov process {Pt} is said to be weakly ergodic if 
the following equation holds. 

lim {P-,z{ko, k) - Pyziko, k)) =0 \/x,y,z e X, V/cq G Z+ 

k—>oo 

If {Pt} is strongly ergodic, the distribution /i converges to the unique distribution fi* from any 
initial state. Weak ergodicity implies that the information on the initial state vanishes as time 
increases though convergence of /i may not be guaranteed. Note that the notions of weak and 
strong ergodicity are equivalent in case of time-homogeneous Markov processes. 
We finally introduce the following well-known results on ergodicity. 

Proposition 2 ll23l A Markov process {Pt} is strongly ergodic if the following conditions hold: 
(Bl) The Markov process {Pt} is weakly ergodic. 
(B2) For each t, there exists a stochastic vector /J on X such that i/ is the left eigenvector 

of the transition matrix P{t) with eigenvalue 1. 
(B3) The eigenvector /i* in (B2) satisfies Ylu^o '^xex l/^x ^ A^x^^l < oo. Moreover, if n* = 

limf^oo /^*> then /i* is the vector in Definition [51 
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III. Learning Algorithm and Main Result 

In this section, we present a learning algorithm called Payoff-based Inhomogeneous Partially 
Irrational Play (PIPIP) and state the main result of this paper. At each iteration t E 1,+, 
the learning algorithm chooses an action according to the following procedure assuming that 
each agent i E V stores previous two chosen actions aj(t — 2),aj(t — 1) and the outcomes 
Ui{a{t — 2)), Ui{a{t — 1)). Each agent first updates a parameter e called exploration rate by 

e(t) = t-^Jprnr^ (6) 

where D is defined as Z^ := maXjgyDj and Di is the minimal number of steps required for 
transitioning between any two actions of agent i. 

Then, each agent compares the values of Ui{a{t — 1)) and Ui{a{t — 2)). If Ui{a{t — 1)) > 
Ui{a{t — 2)) holds, then he chooses action aj(t) according to the rule: 

• aj(t) is randomly chosen from 7^j(aj(t — 1)) \ {ai{t — 1)} with probability e{t), (it is called 
an exploratory decision). 

• ai{t) = ai{t — 1) with probability 1 — e{t). 

Otherwise (Ui{a{t — 1)) < Ui{a{t — 2))), action ai(t) is chosen according to the rule: 

• ai{t) is randomly chosen from TZi{ai{t — 1)) \ {ai{t — l),aj(t — 2)} with probability e{t) 
(it is called an exploratory decision). 

• ai{t) = ai{t — 1) with probability 

(1 - e{t)){K ■ £(t)^'), \ ■= UMt - 2)) - U,{a{t - 1)) (7) 

(it is called an irrational decision). 

• aj(t) = ai{t — 2) with probability 

(l-£(t))(l-ft:-£(t)^')- (8) 

The parameter k should be chosen so as to satisfy 

1 1 



K E 



C := maxmax |7^j(aj)|, (9) 



-C-1'2. 

where |7^j(aj)| is the number of elements of the set TZi(a,i). It is clear under the third item of 
Assumption [U that the action ai{t) is well-defined. 
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Algorithm 1 Payoff-based Inhomogeneous Partially Irrational Play (PIPIP) 

Initialization: Action a is chosen randomly from A. Set a} ^ a^, a'^ ^ a^, f// ^ 

Ui{a), Uf ^ Ui{a), Ai ^ for alH G V and t ^ 2. 

Stepl:£^t(-VW^+i))). 

Step 2: If Ul>Uf, then set 

tmp J rnd(7^i(a,l) \ {a,^}), w.p. e 

[ a!, w.p. 1-e 

Otherwise, set 

rnd(7^i(a,l) \ {a,^ a^}), w.p. e{t) 

a\, w.p. (l-e)(/t-e^') 

a^, w.p. (l-£)(l-K-e^O 

Step 3: Execute the selected action a*™^ and receive f/*™^ ^ [/^(a*'"^). 

Step 4: Set of ^ a\, a] ^ af ^ Uf ^ U} , U} ^ Uf" and A, ^ f/^ - U}. 

Step 5: t ^ t + 1 and go to Step 1. 



Finally, each agent i executes the selected action aj(t) and computes the resulting utility 
Ui{a{t)) via feedbacks from environment and neighboring agents. At the next iteration, agents 
repeat the same procedure. 

The algorithm PIPIP is compactly described in Algorithm [H where the function rnd(^') 
outputs an action chosen randomly from the set A'. Note that the algorithm with a constant 
e(t) = £ G (0, 1/2] is called Payojf-based Homogeneous Partially Irrational Play (PHPIP), 
which will be used for the proof of the main result of this paper. 

PIPIP is developed based on the learning algorithm DISL presented in [7]. The main difference 
of PIPIP from DISL is that agents may choose the action with the lower utility in Step 2 with 
probability (1 —e){K-e'^^) which depends on the difference of the last two steps' utilities A^ and 
the parameters k and e. Thanks to the irrational decisions, agents can escape from undesirable 
Nash equilibria as will be proved in the next section. 

We are now ready to state the main result of this paper. Before mentioning it, we define 

B ■= {(a, a')eAx A\ a[ G 7^i(a,) Vz G V}. (10) 
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and (^(T) as the set of the optimal Nash equilibria, i.e. potential function maximizers, of a 
constrained potential game T. 

Theorem 1 Consider a constrained potential game T satisfying Assumptions [1] and [2l Suppose 
that each agent behaves according to Algorithm [T] Then, a Markov process {Pt} is defined over 
the space B and the following equation is satisfied. 

lim Prob [z{t) G diag {({T))] = 1, (11) 

t—>oo 

where z{t) := {a{t - 1), a{t)) and diag(^') = {{a,a) e Ax A\ ae A'}, A' C A. 

Equation (fTTI) means that the probability that agents executing PIPIP take one of the potential 
function maximizers converge to 1. The proof of this theorem will be shown in the next section. 
In PIPIP, the parameter e{t) is updated by Q to prove the above theorem, which is the same 
as DISL. However, this update rule takes long time to reach a sufficiently small 6{t) when the 
size of the game, i.e. n(D + 1) is large. Thus, from the practical point of view, we might have to 
decrease e{t) based on heuristics or use PHPIP with a sufficiently small e. Even in such cases, 
the following theorem at least holds similarly to the paper [[20|. 

Theorem 2 Consider a constrained potential game T satisfying Assumptions [T] and [2l Suppose 
that each agent behaves according to PHPIP. Then, given any probability p < 1, if the exploration 
rate e is sufficiently small, for all sufficiently large time t E Z+, the following equation holds. 

Prob [z{t) E diag (C(r))] > p. (12) 

Theorem [2] assures that the optimal actions are eventually selected with high probability as long 
as the final value of e{t) is sufficiently small irrespective of the decay rate of e{t). 

IV. Proof of Main Result 

In this section, we prove the main result of this paper (Theorem [T]). We first consider PHPIP 
with a constant exploration rate e. The state z{t) = (a{t — 1), a(t)) for PHPIP with e constitutes 
a perturbed Markov process {P/} on the state space B = {{a,a') E Ax A\ a'- E Tli{ai) \/i E V}. 

In terms of the Markov process {P^} induced by PHPIP, the following lemma holds. 

Lemma 1 The Markov process {Pf } induced by PHPIP applied to a constrained potential game 
r is a regular perturbation of {P°} under Assumption [IJ 
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Proof: Consider a feasible transition z^ —^ z^ with z^ = (a", a^) G B and 2;^ = (a^, a^) G i3 
and partition the set of agents V according to their behaviors along with the transition as 

Ai = {2 G V| f/.(a^) > Uiia^), a\ G 7^,(a,^) \ {a,^}}, 

A2 = {2 G V| Ui{a^) > Ui{a^), aj = a]}, 

A3 = {z G V| Ui{a^) < Uiia""), aj G TZi^a]) \ {ala\}}, 

A4 = {z G V| Ui{a^) < Ui{a^), a,^ = a]}, 

A5 = {z G V| U,{a^) < U,{a°), a] = a°}. 

Then, the probability of the transition z^ — )■ z"^ is represented as 



p^'.' = nm^^'<n(i-)xn 






xY[{l-e)Ke^^ xY[{l-e){l-Ke^^), (13) 

ieA4 J6A5 

where /ij = 1 if a° = aj and hi = 2 otherwise. We see from (fT3] ) that the resistance of transition 

2;^ — )■ 2;^ defined in ([3]) is given by |Ai| + IA3I + X]jgA4 ^« since 

,^0^|Ail + |A3l+E,,A,A,: 11 17^^(^1)1 _ ill |7^.(^l)|_/,. 

holds. Thus, (A3) in Definition [3] is satisfied. In addition, it is straightforward from the procedure 
of PHPIP to confirm the condition (A2). 

It is thus sufficient to check (Al) in Definition [3l From the rule of taking exploratory actions 
in Algorithm 1 and the second item of Assumption [H we immediately see that the set of 
the states accessible from any 2; G i3 is equal to B. This implies that the perturbed Markov 
process {P/} is irreducible. We next check aperiodicity of {Pf}. It is clear that any state in 
diag(^) = {{a,a) ^ Ax A\ a ^ A} has period 1. Let us next pick any {aP,a}) from the set 
B \ diag(^). Since a° G Tliia]) holds iff a] G Tli{a'^) from Assumption [H the following two 
paths are both feasible: {a'^,a^) — )■ {a^,a'^) — )■ {a^,a^), {a°,a^) — )■ {a^,a^) — )■ {a^,a^) — )■ {a°,a^). 
This implies that the period of state (a°, a^) is 1 and the process {Pf } is proved to be aperiodic. 
Hence the process {P/} is both irreducible and aperiodic, which means (Al) in Definition [3l 

In summary, conditions (A1)-(A3) in Definition [3] are satisfied and the proof is completed. ■ 
From Lemma [H the perturbed Markov process {Pf } is irreducible and hence there exists a 
unique stationary distribution ii{e) for every e. Moreover, because {P/} is a regular perturbation 
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of {Pf}, we see from the former half of Proposition [1] that Ume^o+ l^i^) exists and the limiting 
distribution fi(0) is the stationary distribution of {P^}. 

We also have the following lemma on the Markov process {Pf} induced by PHPIP. 

Lemma 2 Consider the Markov process {P^} induced by PHPIP applied to a constrained 
potential game T. Then, the recurrent communication classes {7/4} of the unperturbed Markov 
process {Pf} are given by elements of diag(^) = {{a, a) E Ax A\ a E A}, namely 

n^ = {{a\a')}, ^el,---,|^|. (15) 

Proof: Because of the rule at Step 2 of PHPIP, it is clear that any state belonging to diag(^) 
cannot move to another state without explorations, which implies that all the states in diag(.4) 
itself form recurrent communication classes of the unperturbed Markov process {Pt}- 

We next consider the states in B \ diag(^) and prove that these states are never included 
in recurrent communication classes of the unperturbed Markov process {Pf}- Here, we use 
induction. We first consider the case of n = I. If Ui{a\) > Ui{a1), then the transition (aj, a\) — > 
(a|, a\) is taken. Otherwise, a sequence of transitions (a?, a\) —^ {a\, a?) — )■ (a?, a?) occurs. Thus, 
in case of n = 1, the state (aj, a\) E B\ diag(^) is never included in recurrent communication 
classes of {Pt}. 

We next make a hypothesis that there exists a /c G Z+ such that all the states in ;B \ diag(^) 
are not included in recurrent communication classes of the unperturbed Markov process {P°} 
for all n < k. Then, we consider the case n = k + 1, where there are three possible cases: 

(i) U,{a') > U,{a^) Vz G V = {1, ■ ■ ■ , A: + 1}, 

(ii) U,{a^) < U^{a^) Vz G V = {1, ■ ■ ■ , A; + 1}, 

(iii) Ui{a^) > Ui{a?) for / agents where / G {2, ■ ■ ■ , /c}. 
In case (i), the transition (a°, a^) — )■ (a\ a^) must occur for £ = and, in case (ii), the transition 
(a", a}) — )■ (a^, a°) — )■ (a°, a°) should be selected. Thus, all the states in i3\diag(^) satisfying (i) 
or (ii) are never included in recurrent communication classes. In case (iii), at the next iteration, 
all the agents i satisfying f/j(a^) > Ui{aP) choose the current action. Then, such agents possess a 
single action in the memory and, in case of e = 0, each agent has to choose either of the actions 
in the memory. Namely, these agents never change their actions in all subsequent iterations. The 
resulting situation is thus the same as the case of n = k + 1 — I. From the above hypothesis, 
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we can conclude that the states in case (iii) are also not included in recurrent communication 
classes. In summary, the states in ,B\diag(^) are never included in the recurrent communication 
classes of {Pt}. The proof is thus completed. ■ 

A feasible path over the process {P/} from z E B to z' E B is especially said to be a route 
if both of the two nodes z and z' are elements of diag(^) C B. Note that a route is a path and 
hence the resistance of the route is also given by &. Especially, we define a straight route as 
follows, where we use the notation 

^single ■= {{z = {a,a),z' = {a', a')) E diag(^) x diag(^)| 

3i eV s.t. ai E Tli{a[), ai ^ a[ and a_j = a'_j}. (16) 



Definition 7 (Straight Route) A route between any two states z^ = {a'^,a'^) and z^ = {a^,a^) 
in diag(^) such that (z^,z^) E Ssmgie is said to be a straight route if the path is given by the 
transitions on the Markov process {Pf } such that only one agent i changes his action from a° 
to a] at first iteration and the explored agent i selects the same action a] at the next iteration 
while the other agents choose the same action a^_^ = a}_^ during the two steps. 

In terms of the straight route, we have the following lemma. 

Lemma 3 Consider paths from any state z^ = {a^,a^) E diag(^) to any state z^ = {a^,a^) E 
diag(v4) such that {z^,z^) E Esingie over the Markov process {Pt} induced by PHPIP applied 
to a constrained potential game V. Then, under Assumption [21 the resistance A(r) of the straight 
route r from z^ to z^ is strictly smaller than 2 and the resistance A(r) is minimal among all 
paths from z^ to z^. 

Proof: Along with the straight route, only one agent i first changes action from a° to a\, 
whose probability is given by 

It is easy to confirm from (flTl) that the resistance of the transition {a^,a!^) — > {a!^,a^) is equal 
to 1. We next consider the transition from {a^,a^) to {a},a^). If Ui{a}) > Ui{a^) is true, the 
probability of this transition is given by (1 — e)", whose resistance is equal to 0. Otherwise, 
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Ui{a^) < Ui{a!^) holds and the probability of this transition is given by (1 — e)" x Ke^\ whose 
resistance is A^. Let us now notice that the resistance A(r) of the straight route r is equal to 
the sum of the resistances of transitions {a^,a!^) — )■ {a^,a}) and {a^,a^) — )■ {a^,a^) from ([5]) 
and that A^ < 1 from Assumption [2l We can thus conclude that A(r) is smaller than 2. It is 
also easy to confirm that the resistance of paths such that more than 1 agents take exploratory 
action should be greater than 2. Namely, the straight route gives the smallest resistance among 
all paths from z^ = (a°, a°) to z'^ = (a^, a^) and hence the proof is completed. ■ 

We also introduce the following notion. 

Definition 8 (m-Straight-Route) An m- straight-route is a route which passes through m ver- 
tices in diag(^) and all the routes between any two of these vertices are straight. 

In terms of the route, we can prove the following lemma, which clarifies a connection between 
the potential function and the resistance of the route. 

Lemma 4 Consider the Markov process {Pf} induced by PHPIP applied to a constrained poten- 
tial game F. Let us denote an m-straight-route r over {P/} from state z^ = (a°, a°) G diag(^) 
to state z^ = {a^,a^) E diag(^) by 

where z^^^ = (a^*\a^*^) G diag(^),z G {0, ■ •• ,m — 1} and all the arrows between them are 
straight routes. In addition, we denote its reverse route r' by 

which is also an m-straight route from z^ to z^. Then, under Assumption [2l if 0(a°) > (pia^), 
we have A(r) > A(r'). 

Proof: We suppose that the route r contains p straight routes with resistance greater than 1 
and r' contains q straight routes with resistance greater than 1. Let us now denote the explored 
agent along with the route z^^'> =^ 2;*^*+^^ by ji and that with z^'^^ -^ 2;(«+i) by j'^. From the proof 
of Lemma [3l the resistance of the route z^''^ =^ 2;*^*+^^ should be exactly equal to 1 (in case of 
f/j^(a(*+^)) > t/j,(a«)) or equal to 1 + A^^ G (1,2) (in case of UjXa^'+^^) < Uj^a^'^)). From 
©, the following equation holds. 

A,, = t/,,(a«) - t/,,(a(^+^)) = 0(a«) - 0(a(^^-^)) = t/,.(a») - U^Sa^^-^'^) = -A,,. (20) 
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Namely, one of the resistances of the straight routes 2;*^*) =^ 2;*^*+^^ and 2;(*+^) <^ z^^^ is exactly 
1 and the other is greater than 1 except for the case that [/((a*-*"^^^) = Ui{a^^^) in which the 
resistances are both equal to 1. An illustrative example of the relation is given as follows, where 
the numbers put on arrows are the resistances of the routes. 

^(0) ^ ^0 1+4^0 ^(1) 4 ^(2) 1+4^ ^(3) 4 ... 4 ^(™-3) 4 ^(m-2) 1+%™"^ ^(— 1) = ^^ 

l+A., , 1+A., 1+A,, 1+A,, 

^(0)= ^04^(1) ^^1^(2)4^(3) ^^3... ^-^^im-3) 4-^-3 ^(„_2) 4 ^(m-1) ^ ^1 

Namely, the inequality p + q < m — 1 holds true. Let us now collect all the Aj- such that the 
resistance of z^^'> =^ 2;'^*+^^ is greater than 1 and number them as Ai, A2, ■ ■ ■ , Ap. Similarly, we 
define A'^, Ag, ■ • ■ , A^ for the reverse route r'. Then, from equations in (|20|) . we obtain 

Ai + A2 + ■ ■ ■ + Ap - (A; + A^ + ■ ■ ■ + a;) = 0(a°) - <f){a'). (21) 

Note that (|2TI) holds even in the presence of pairs {a^^\ a*^*"*"^^) such that f/j-(a'^*"'"^)) = Uj.{a^^'>). 
Since Ai H h Ap = A(r) - (m - 1) and A'^ H \- ^'g = K'^') - (m - 1) from ©, we obtain 

A(r) = A(r') + 0(a°)-0(a^). (22) 



It is straightforward from (122)) to prove the statement in the lemma. ■ 

Let us form the weighted digraph G over the recurrent communication classes for the Markov 
process {P/} induced by PHPIP as in Subsection IH-Bi where the weight wik of each edge 
{Hi,Hk) is equal to the minimal resistance xik among all the paths connecting two recurrent 
communication classes Hi and H^. From Lemma [2l the nodes of the graph G are given by each 
element of the set diag(^) and hence G = {diag{A),£,yV),£ C diag(^) x diag(^). Since all 
the recurrent communication classes have only one element as in (fT5l) . the weight wik for any 
two states z'', z'' E diag(^) is simply given by the path with minimal resistance among all paths 
from z'- to 2;^. In addition. Lemma [3] proves that if {z\z^) G Ssingie, the weight wik = Xik is 
given by the resistance of the straight route from z' to z''. 

Let us focus on /-trees over G whose root is a state z^ E diag(^). Recall now that the 
resistance of the tree is the sum of the weights of all the edges constituting the tree as defined 
in Subsection III-BI Then, we have the following lemma in terms of the stochastic potential of 
z^, which is the minimal resistance among all /-trees in Q{1). 



July 26, 2011 DRAFT 



17 




diagA 
X<2 
■X>2 



w Kruskal's Algorithm 



OilH HI 

Fig. 1. Image of Kruskal's Algorithm 

Lemma 5 Consider the weighted directed graph G constituted from the Markov process {P/} in- 
duced by PHPIP applied to a constrained potential game T. Let us denote by T = (diag(^), £^;, W) 
the /-tree giving the stochastic potential of z^ E diag(^). If Assumptions [T] and [2] are satisfied, 
then the edge set Si must be a subset of Esingie- 

Proof: The edges of G, denoted by £, are divided into two classes: Eg '■= ^single and 
Sd := £\£s- From Lemma [3l the weights of the edges in Eg are smaller than 2. We next consider 
the weights of the edges in Sd- Because of the nature of PHPIP, any agent cannot change his 
action to another one without explorations when z(t) E diag(^), and hence exploration should 
be executed more than twice in order that the transition along with an edge in S^. occurs. This 
implies that the weights of edges in Sd should be greater than 2. 

Hereafter, we simply rewrite the weights of the edges £s by Ws{< 2) and those of £d by 
Wdi> 2) and build the minimal resistance tree with root z'- over this simplified graph. Note that 
this simplification does not change the elements of the edge set £i. It should be noted that from 
Assumption [T] all recurrent communication classes (diag(^)) can be connected by passing only 
through straight routes. From the procedure of Kruskal's Algorithm, edges with resistances Wd 
are never chosen as edges of the minimal tree as illustrated in Fig. \T\ Thus, the tree giving the 
stochastic potential must consist only of the edges in Eg, which completes the proof. ■ 

We are now ready to state the following proposition on the stochastically stable states (Defi- 
nition lU for the Markov process {-P/}. 
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Fig. 2. Resistance Trees (tiie left tree should have a greater resistance than the right) 

Proposition 3 Consider {Pf } induced by PHPIP applied to a constrained potential game F. If 
Assumptions [H and |2] are satisfied, then the stochastically stable states are included in diag(C(r)), 
with the set of the optimal Nash equilibria C(r). 

Proof: From Proposition [H Lemmas [T] and [2l it is sufficient to prove that the states in 
diag(^) with the minimal stochastic potential over G are included in C(r). 

Let us introduce the notations Znonopt = [cinonopt-, a-nonopt) ^ diag(^) with a non optimal action 
O'nonopt and Zopt = {aopt, CLopt) £ diag(^) with an optimal Nash equilibrium aopt- If Znonopt is the 
root of a tree T, there exists a unique route from Zopt to Znonopt over T. From Lemma [5l the 
route r is an m- straight-route for some m. Now, we can build a tree T' with root Zopt such that 
only the route r is replaced by its reverse route r' (Fig. O. Then, we have A(r) > A(r') from 
Lemma |4] since (p{aopt) > (p{anonopt)- Thus, the resistance of T' is smaller than that of T and the 
stochastic potential of Zopt is smaller than the resistance of T'. The statement holds regardless 
of the selection of anonopt- This completes the proof. ■ 

We next consider PIPIP with time-varying e{t) and prove strong ergodicity of {-P/}. 

Lemma 6 The Markov process {-P/} induced by PIPIP applied to a constrained potential game 
r is strongly ergodic. 

Proof: We use Proposition [2] for the proof. Conditions (B2), (B3) in Proposition [2] can be 
proved in the same way as [|71. We thus show only the satisfaction of Condition (Bl). As in 
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(fT3] ), the probability of transition z^ — )• z'^ is given by 



at: A, I »\ 't/l ,;£:A„ ..'t: A „ I ' 



JGAi I- -'^"-'^1 igA2 igAs '""'^ * 



a|)| -/ii 
X JJ(l-e)fi:e^' X JJ(l-e)(l-fi:£^'). (23) 

igA4 J6A5 

Since e(t) is strictly decreasing, there is to > 1 such that to is the first time when 

F(f) rftV^^^') 

(1 - e(t))(l - «:e(t)^') > ^, 1 - .(t) > -^^^^ (24) 

holds. Note that the existence of e satisfying ((24)) is guaranteed from the condition dH). For all 
t > to, we have 

n%^W>(^)^ (25) 

The remaining part of the proof is the same as f7\ and omit it in this paper. ■ 

We are now ready to prove Theorem [T] From Lemma [6l the distribution /i(£:(t)) converges 
to the unique distribution yU* from any initial state. In addition, we also have fi* = /i(0) = 
lim£^oAi(£^) from limt^oo£^(^) = 0. We have already proved from Propositions [T] and [3] that any 
state z satisfying fizi^) > must be included in diag(C(r)). Therefore, 

\imt_^PToh[z(t) e diag(C(r))] = 1, 

is proved, which completes the proof of Theorem \T\ Theorem |2] is also proved from Proposition 
[H Lemma \T\ and Proposition [3l 

V. Application to Sensor Coverage Problem 

In this section we demonstrate the effectiveness of the proposed learning algorithm PIPIP 
through experiments of the sensor coverage problem investigated e.g. in [|3]|, flU, ^ whose 
objective is to cover a mission space efficiently using distributed control strategies. In particular, 
the problem of this section is formulated based on [71 with some modifications. 

A. Problem Formulation 

We suppose that the mission space to be covered is given by Q^ C M^ and that a density 
function H^'^(g), g G Q'^ is defined over Q"^. In particular, to constitute a game in the form of the 
previous sections, we also prepare a discretized mission space Q consisting of a finite number 
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of points in Q'^. Accordingly, we also define the discretized version of the density W{q), q E Q 
such that W{q) = W%q) Vg G Q. 

In the problem, the position of agent i in the mission space Q is regarded as the action Oj to 
be determined, and hence the action set Ai is given by a subset of Q for all i E V. Namely, 
each agent i chooses his action Oj from the finite set Ai C Q at each iteration and move toward 
the corresponding point. 

Suppose now that each sensor has a limited sensing radius r^ and that agent i located at 
ai E Q may sense an event at g G Q iff g G 'D(ai) := {q E Q\ \\q — aj|| < Tm}. We also denote 
by ng{a) the number of agents such that q E 'D{ai) when agents take the joint action a. Then, 
we define the function 

q£Q 1=1 

This function means, as ng{a) increases, the sensing accuracy at q E Q improves but the 
increment decreases, which captures the characteristics of the sensor coverage problem. Note 
that the authors in take account of energy consumption of sensors in addition to coverage 
performance and claim that the function cannot be a performance measure. However, we do 
not consider the energy consumption and what is the best selection of the performance measure 
depends on the subjective views of designers. We thus identify maximization of with the global 
objective of the group letting be the potential function. 
Let us now introduce the utility function 

W{q) 



UM)= E 



Then, equation ([U) holds for the above potential function Q and hence a potential game is 
constituted. It is also easy to confirm that the utility Ui{a) can be locally computed if we assume 
feedbacks of Wq, q E V{ai) from environment and of the selected actions a^, j ^^ i only from 
neighboring agents specified by the 2rm-disk proximity communication graph ^. 

B. Objectives 

In this section, we run two experiments whose objectives are listed below. 
• Demonstration of effectiveness: Theorems [T] and [2] assure statements after infinitely long 
time but it is required in practice that the algorithm works in finite time. The first objective 
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Fig. 3. Mobile Robot 



is thus to check if the agents successfully cover the mission space (i) even in the presence 
of constraints such as obstacles and mobility constraints, and (ii) in the absence of the prior 
information on the density function. The second objective is to compare its performance 
with the learning algorithm DISL, which is chosen to ensure fair comparisons. Indeed, the 
other existing algorithms require either or both of prior knowledge on density or free motion 
without constraints. 
• Adaptability to environmental changes: In many real applications of sensor coverage schemes, 
it is required for sensors to change the configuration according to the surrounding environ- 
ment. In particular, the density function can be time-varying e.g. in the scenario such as 
measuring of radiation quantity in the air and sampling of some chemical material and 
temperature in the ocean. It is expected for payoff-based algorithms to naturally adapt to 
such environmental changes without altering action selection rules and any complicated 
decision-making processes due to the characteristics that prior knowledge on environments 
except for Ai is not assumed. We thus check the function by using a Gaussian density 
function whose mean moves as time advances. 

C. Experimental System 

In the experiments, we use four mobile robots with four wheels which can move in any 
direction (Fig. [3]). Fig. |4] shows the schematic of the experimental system. A camera (Firefly 
MV (ViewPLUS Inc.) with lenses LTV2Z3314CS-IR (Raymax Inc.)) is mounted over the field. 
The image information is sent to a PC and processed to extract the pose of robots from the image 
by the image processing library OpenCV 2.0. Note that a board with two colored feature points 
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Fig. 5. Setting of Experiment 1 



Fig. 4. Experimental Schematic 



is attached to each robot as in Fig. [3] to help the extraction. According to the extracted poses, 
the actions to be taken by agents are computed based on learning algorithms. However, in the 
experiments, the selected actions are not executed directly since collisions among robots must be 
avoided. For this purpose, a local decision-making mechanism checks whether collisions would 
occur if the selected actions were executed. The mechanism is designed based on heuristics and 
we avoid mentioning the details since it is not essential. If the answer of the mechanism is 
yes, the agents decide to stay at the current position. Otherwise, the selected actions are sent as 
reference positions together with the current poses to the local velocity and position PI controller 
implemented on a digital signal processor DS-1104 (dSPACE Inc.). Then, the eventual velocity 
command is sent to each robot via a wireless communication device XBee (Digi International 
Inc.). 

The following setup is common in all experiments. The mission space Q^ := [0 2.7]m x 
[0 1.8]m is divided into 9x6 squares with side length 0.3m as in Fig. |5] letting the discretized 
set Q be given by the centers of the squares as 

Q = {(0.15 + 0.3j, 0.15 + 0.301 j G {O,--- ,8}, / G {0, ■ ■ ■ ,5}}. 

The sensing radius r^ is set as r^ = 0.3m for all robots. We also assume that each agent has 
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Fig. 6. Configurations by DISL (Experiment 1) 



a mobility constraint 

7^i(a,) = {ai ± 0.3(6i, h) e A\ h e {-1,0, 1}, 62 G {-1, 0, 1}}. 
The initial actions of agents are set as 

ai(0) = (0.15,0.15), 02(0) = (0.15,0.45), 03(0) = (0.45,0.15), 04(0) = (0.45,0.45). 

D. Experiment 1 

In the first experiment, we demonstrate the effectiveness of PIPIP. For this purpose, we employ 
the density function 

W{q) = e 9 — , fi = (1.95, 1.35) 

and prepare obstacles at 

O := {(0.75, 1.35), (1.05, 1.05), (1.35, 0.75), (1.65, 0.45)}. (26) 

Namely, in the experiment, the action sets are given hy At = Q\0. The setup is illustrated in 
Fig. [5l where the region with high density is colored by yellow and the red cross mark indicates 
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Fig. 7. Configurations by PIPIP (Experiment 1) 



the actions prohibited to be taken by the obstacles. Under the situation, we see that there exist 
some Nash undesirable equilibria just ahead on the left of the obstacles. It should be also noted 
that each robot does not know the function W{q) a priori. 

We first run DISL under the above situation with the exploration rate e = 0.15. Then, the 
resulting configurations at 0, 150, 300, 450, 600 and 700 steps are shown in Fig. [6l Under 
the setting, three robots cannot reach the colored region at least in 700 step. It is now easily 
confirmed that the configurations at 600 and 700[step] are Nash equilibria only for the three 
robots and hence they cannot increase utilities by any one agent's action change. 

We next run PIPIP letting the parameter e be fixed as e = 0.15 and setting k = 0.5 (namely, 
PHPIP is actually run in the experiment). Fig. |7] shows resulting configurations at the same steps 
as Fig. |6l Surprisingly, we see that all the robots eventually avoid the obstacles and arrive at 
the colored region though they initially do not know where is important. Such a behavior is 
never achieved by conventional coverage control schemes. The time responses of the potential 
function (p for PIPIP and DISL are illustrated in Fig. [H where the solid line shows the response 
for PIPIP and the dashed line for DISL. As is apparent from the above investigations, PIPIP 
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0.15 Fig- 9. Time Evolution of Potential Function for e = 0.3 

(Experiment 1) 



achieves a higher potential function value than DISL. 

Though we can show only one sample due to the page constraints, similar results are obtained 
for both DISL and PIPIP through several trials. From the results, we claim that PIPIP has a 
stronger tendency to escape undesirable Nash equilibria than DISL, which is also confirmed by 
the meaning of the irrational decision. Of course, the results strongly depend on the value of 
exploration rate e. We thus show the time evolution of the function (p for e = 0.3 in Fig. [9l We 
see from Fig. |9] that some agents executing DISL also do not reach the important region even for 
e = 0.3, which seems to be quite high probability as an exploration rate. Indeed, the fluctuation of 
the responses is large and an agent with PIPIP overcomes the obstacle again leaving the colored 
region. From all the above results, we thus can state that guarantees of only convergence to 
Nash equilibria can be a significant problem not only from the theoretical point of view but 
also from the practical viewpoint. Though much more thorough comparisons are necessary in 
order to make the claim on superiority of PIPIP over DISL confident, PIPIP achieves a better 
performance than DISL at least in the setup. 

E. Experiment 2 

We next demonstrate the adaptability of PIPIP to environmental changes, where we get rid of 
the obstacle O and hence Ai = Q. In the experiment, we use the following Gaussian density 
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function whose mean gradually moves. 
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It is worth noting that agents select actions without using any prior information on the density. 
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Figs \T0\ and \TT\ respectively illustrate the resulting configurations at 0, 200, 400, 600, 800 
and 1000 steps and time evolution of the potential function (J). We see from Fig. [10] that agents 
gather at around the most important region at any time instant while learning the environmental 
changes. Fig. [TT] also shows that the potential function keeps almost the same level during whole 
time, which indicates that the agents successfully track the most important region. From these 
results, as expected, agents executing PIPIP successfully adapt to the environmental changes 
without changing the action selection rule at all. Such a behavior is also never achieved by 
conventional coverage control schemes. 

VI. Conclusion 

In this paper, we have developed a new learning algorithm Payoff-based Inhomogeneous 
Partially Irrational Play (PIPIP) for potential game theoretic cooperative control of multi-agent 
systems. The present algorithm is based on Distributed Inhomogeneous Synchronous Learning 
(DISL) presented in [jT] and inherits several desirable features of DISL. However, unlike DISL, 
PIPIP allows agents to make irrational decisions, that is, take an action giving a lower utility from 
the past two actions. Thanks to the decision, we have succeeded proving convergence of the joint 
action to the potential function maximizers while escaping from undesirable Nash equilibria. 
Then, we have demonstrated the utility of PIPIP through experiments on a sensor coverage 
problem. It has been revealed through the demonstration that the present learning algorithm works 
even in a finite-time interval and agents successfully arrive at around the optimal Nash equilibria 
in the presence of obstacles in the mission space. In addition, we also have seen through an 
experiment with a moving density function that PIPIP has adaptability to environmental changes, 
which is a function expected for payoff-based learning algorithms. 
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